Colloidal gels are found in many applications like food processing, cosmetics, cell biology etc. Such materials usually have a very non-linear behavior. In this work, the gelation process of a colloidal suspension is studied over time. There are two important factors that control the amount of time it takes for a gel to form, volume fraction (\(\phi\)) and interparticle attraction (\(D_0\)). In an edible gel, take volume fraction as a measure of how much gel powder you add to the water. The more powder you add, the higher your volume fraction is. Interparticle attraction, as the name says, is a measure of how attractive those powder particles are when suspended in the fluid. There are several questions we are going to answer in this work. We’re going to observe the data and answer the following questions accordingly:
Remove previously defined variables and clear Console and set a seed for our results to be reproducable.
rm(list = ls())
cat("\014")
set.seed(12345)
The data used in this project is obtained from some of my simulations of the colloidal gels. HOOMD-blue package is used for the simulations and the Networkx module in python is used in order to generate the particles’ structure at each time of the simulation.
gel.data = read.csv('N_Clusters.csv', header = TRUE)
Let’s take a look at the attributes of our data, using the names() function. (summary will be looked at a bit later, this is to avoid messy outputs)
names(gel.data)
## [1] "phi0.05D00" "phi0.05D02" "phi0.05D04" "phi0.05D06" "phi0.05D08"
## [6] "phi0.05D010" "phi0.05D012" "phi0.05D014" "phi0.05D016" "phi0.05D018"
## [11] "phi0.05D020" "phi0.05D022" "phi0.06D00" "phi0.06D02" "phi0.06D04"
## [16] "phi0.06D06" "phi0.06D08" "phi0.06D010" "phi0.06D012" "phi0.06D014"
## [21] "phi0.06D016" "phi0.06D018" "phi0.06D020" "phi0.06D022" "phi0.07D00"
## [26] "phi0.07D02" "phi0.07D04" "phi0.07D06" "phi0.07D08" "phi0.07D010"
## [31] "phi0.07D012" "phi0.07D014" "phi0.07D016" "phi0.07D018" "phi0.07D020"
## [36] "phi0.07D022" "phi0.08D00" "phi0.08D02" "phi0.08D04" "phi0.08D06"
## [41] "phi0.08D08" "phi0.08D010" "phi0.08D012" "phi0.08D014" "phi0.08D016"
## [46] "phi0.08D018" "phi0.08D020" "phi0.08D022" "phi0.09D00" "phi0.09D02"
## [51] "phi0.09D04" "phi0.09D06" "phi0.09D08" "phi0.09D010" "phi0.09D012"
## [56] "phi0.09D014" "phi0.09D016" "phi0.09D018" "phi0.09D020" "phi0.09D022"
## [61] "phi0.1D00" "phi0.1D02" "phi0.1D04" "phi0.1D06" "phi0.1D08"
## [66] "phi0.1D010" "phi0.1D012" "phi0.1D014" "phi0.1D016" "phi0.1D018"
## [71] "phi0.1D020" "phi0.1D022" "phi0.11D00" "phi0.11D02" "phi0.11D04"
## [76] "phi0.11D06" "phi0.11D08" "phi0.11D010" "phi0.11D012" "phi0.11D014"
## [81] "phi0.11D016" "phi0.11D018" "phi0.11D020" "phi0.11D022" "phi0.12D00"
## [86] "phi0.12D02" "phi0.12D04" "phi0.12D06" "phi0.12D08" "phi0.12D010"
## [91] "phi0.12D012" "phi0.12D014" "phi0.12D016" "phi0.12D018" "phi0.12D020"
## [96] "phi0.12D022" "phi0.13D00" "phi0.13D02" "phi0.13D04" "phi0.13D06"
## [101] "phi0.13D08" "phi0.13D010" "phi0.13D012" "phi0.13D014" "phi0.13D016"
## [106] "phi0.13D018" "phi0.13D020" "phi0.13D022" "phi0.14D00" "phi0.14D02"
## [111] "phi0.14D04" "phi0.14D06" "phi0.14D08" "phi0.14D010" "phi0.14D012"
## [116] "phi0.14D014" "phi0.14D016" "phi0.14D018" "phi0.14D020" "phi0.14D022"
## [121] "phi0.15D00" "phi0.15D02" "phi0.15D04" "phi0.15D06" "phi0.15D08"
## [126] "phi0.15D010" "phi0.15D012" "phi0.15D014" "phi0.15D016" "phi0.15D018"
## [131] "phi0.15D020" "phi0.15D022" "timestep"
There are 133 attributes here. In this data, each column corresponds to a unique simulation where the value at each row corresponds to the number of interconnected clusters in the system. The name of the attribute shows the simulation configureation, for instance, “phi0.12D02” means \(\phi = 0.1\) and \(D_0 = 2\). A function is written here to extract these values from the names of the attributes.
extract_values <- function(att_name) {
options(digits=3)
sp.name = unlist(strsplit(att_name, ""))
if (sp.name[7] != "D"){
phi = as.double(paste(sp.name[4],sp.name[5],sp.name[6],sp.name[7],sep=""))
name.length = length(sp.name)
D0 = ""
for (j in c(9:name.length)){
D0 = paste(D0,sp.name[j],sep="")
}
D0 = as.double(D0)
}
if (sp.name[7] == "D"){
phi = as.double(paste(sp.name[4],sp.name[5],sp.name[6],sep=""))
name.length = length(sp.name)
D0 = ""
for (j in c(8:name.length)){
D0 = paste(D0,sp.name[j],sep="")
}
D0 = as.double(D0)
}
return(c(phi,D0))
}
Let’s see if the function works properly:
print(paste("phi and D0 for ", names(gel.data)[20], " are: ",extract_values(names(gel.data)[20])[1], " and ",extract_values(names(gel.data)[20])[2]))
## [1] "phi and D0 for phi0.06D014 are: 0.06 and 14"
Make a data frame to store the final state of the simulation for each configuration.
phi_list = c(5:15) * 0.01
D0_list = c(1:11) * 2
phase_space = matrix(0, nrow = length(phi_list), ncol = (length(D0_list)+1))
phase_space[,1] = phi_list
Now, let’s take a look at some some basic details of the data. Use summary() function to observe 10 random attributes:
summary(gel.data[,sample(ncol(gel.data),10)])
## phi0.09D04 phi0.09D018 phi0.12D016 phi0.11D04 phi0.12D022
## Min. : 4 Min. : 1 Min. : 1 Min. : 1 Min. : 2
## 1st Qu.: 5 1st Qu.: 2 1st Qu.: 1 1st Qu.: 2 1st Qu.: 4
## Median : 7 Median : 8 Median : 1 Median : 2 Median : 6
## Mean : 40 Mean : 34 Mean : 15 Mean : 20 Mean : 19
## 3rd Qu.: 19 3rd Qu.: 14 3rd Qu.: 3 3rd Qu.: 8 3rd Qu.: 8
## Max. :991 Max. :984 Max. :979 Max. :984 Max. :968
## NA's :1953 NA's :1841 NA's :1089 NA's :1492 NA's :1408
## phi0.12D06 phi0.05D02 phi0.12D02 phi0.15D018 phi0.08D02
## Min. : 1 Min. : 72 Min. : 2 Min. : 1 Min. : 8
## 1st Qu.: 1 1st Qu.:111 1st Qu.: 3 1st Qu.: 1 1st Qu.: 12
## Median : 2 Median :153 Median : 3 Median : 1 Median : 21
## Mean : 15 Mean :217 Mean : 17 Mean : 9 Mean : 67
## 3rd Qu.: 4 3rd Qu.:256 3rd Qu.: 4 3rd Qu.: 1 3rd Qu.: 56
## Max. :977 Max. :992 Max. :988 Max. :950 Max. :998
## NA's :1415 NA's :3121 NA's :971 NA's :798 NA's :2272
We can see that there are large number of NA values for some attributes. The reason is that each simulation has been running for some unique time step, therefore, all attributes, except for the one with the largest number of time steps, have NA values. Before going further in dealing with the NA values, let’s take a look at the correlation between the attributes using the chart.Correlation() function from PerformanceAnalytics library. Here, few random attributes are selected.
set.seed(123245)
suppressMessages(suppressWarnings(library(PerformanceAnalytics)))
chart.Correlation((gel.data[c(sample(50,5))]), histogram=TRUE, pch=19)
There are two important things seen in the data:
Let’s plot simulations of the same \(\phi\) and see if it gives us anything:
set.seed(12345)
suppressMessages(suppressWarnings(library(PerformanceAnalytics)))
chart.Correlation((gel.data[c(1:12)]), histogram=TRUE, pch=19)
Again, these high correlations are very misleading, what we care about, is the exponential decay that happens to happen at the first steps Interestingly, the case with \(D_0 = 0\) has a completely different behavior than the others. Dynamically speaking, it makes sense because the attraction changes everything and we expect no gelation without it. So, we don’t really need this, let’s remove the simulations with no attraction from data.
D0_cols = c()
counter = 0
for (i in c(1:(ncol(gel.data)-1))){
temp.D0 = extract_values(names(gel.data)[i])[2]
if (temp.D0==0){
counter = counter + 1
D0_cols[counter] = i
}
}
gel.data = gel.data[,-D0_cols]
Now, let’s take one attraction, and look at the correlations of different volume fractions. Also take the logarithm of the data to make them less skewed.
chart.Correlation(log(gel.data[(c(0:4)*11 + 11)]), histogram=TRUE, pch=19)
Take a closer look at those correlation lines, in fact, what we care about, are those initial times where the correlations are not linear. The linear parts of the correlations actually correspond to the steady state and do not count because we care about the decaying process. To avoid a waste of computational size and misleading correlations, for each simulation, the data is trimmed where the number of clusters becomes smaller than 5 (using \(0.005 N_{total}\) as the exponential relaxation cut-off).
# loop over all attributes
max.timestep = 0; #
for (sim in c(1:(ncol(gel.data) -1 ) )){
cut_off_vals = which(gel.data[,sim] > 5)
tmp.phi = extract_values(names(gel.data)[sim])[1]
tmp.D0 = extract_values(names(gel.data)[sim])[2]
last.num = max(which(is.na(gel.data[,sim]) == FALSE))
phase_space[which(phi_list==tmp.phi), (which(D0_list==tmp.D0)+1)] =
mean(gel.data[c(last.num-100:last.num),sim],na.rm = TRUE)
if (length(cut_off_vals) > 0){
if(max(cut_off_vals)>max.timestep) {max.timestep = max(cut_off_vals)}
gel.data[-cut_off_vals,sim] = NA
}
}
phase_space[,-1] = phase_space[,-1]/1000
Before going further, plot the phase space. We need to plot the final state of the mixture with respect to the system configuration.
suppressMessages(suppressWarnings(library(gplots)))
my_palette <- colorRampPalette(c("red", "yellow", "green"))(n = 299)
col_breaks = c(seq(0,.03,length=100), # for red
seq(0.031,0.1,length=100), # for yellow
seq(0.11,1,length=100)) # for green
heatmap.2(phase_space[,-1],main = "Phase Space",
notecol=NULL,density.info="none", col=my_palette,
breaks=col_breaks,Colv="NA", Rowv="NA",
xlab = "Attraction", ylab= "Volume fraction")
The NA values are untouched for these reasons:
NA values that start existing at a point of time for each attribute and are not assigned up to the end of that simulation. It’s easier to work with them if they’re just there.Let’s normalize the number of clusters with the possible maximum number of clusters \(N = 1000\). Don’t normalize the timestep attribute.
#gel.data[,-ncol(gel.data)] = gel.data[,-ncol(gel.data)] / 1000
\(80 \%\) of the simulations are selected as the training sets and the other are used for validation.
tr.indc = sample(ncol(gel.data[-1]), floor(0.8 * ncol(gel.data)))
gel.tr1 = gel.data[tr.indc]
gel.vl1 = gel.data[-tr.indc]
Make a linear regression to model the gel relaxation over time for all simulations. Each gel relaxes exponentially: \(N(t) = N_0 e^{(-t/\tau)} \therefore log(N(t)) = log(N_0) - t/\tau\) Therefore, what we feed to the linear regression is: \(y = ax + b\) where \(y \equiv log(N(t))\), \(a \equiv log(N_0)\) and \(b \equiv -1/\tau\).
relaxation_matrix = matrix(0, nrow = length(phi_list), ncol = length(D0_list))
relaxation_matrix2 = matrix(0, nrow = length(phi_list), ncol = length(D0_list))
R_sq = matrix(0, nrow = length(phi_list), ncol = length(D0_list))
#gel.data = gel.data[-c(1:10),]
set.seed(12345)
for (i in c(1:(ncol(gel.data)-1))){
# y = a * exp(- t / tau) = a * exp( b * t), where b = 1/tau
# log(y) = log(a) + b * t
train_rows = which(gel.data[,i] > 200)
lm1 = lm(I(log(gel.data[train_rows,i]) - 0) ~ I(timestep[train_rows] - log(1000)), data = gel.data)#, offset = phi0.05D02
tmp.phi = extract_values(names(gel.data)[i])[1]
tmp.D0 = extract_values(names(gel.data)[i])[2]
row.ind = which(phi_list==tmp.phi); col.ind = which(D0_list==tmp.D0)
relaxation_matrix[row.ind, col.ind] = -1/lm1$coefficients[2]
relaxation_matrix2[row.ind, col.ind] = lm1$coefficients[1]
R_sq[row.ind, col.ind] = summary(lm1)$r.squared
}
first, let’s see what is the minimu R-squared value for all the models;
min(R_sq)
## [1] 0.923
Well, this means the model with least accuracy, still is a very good approximation of the actual data. Although, we must still be aware that this might be becuase of the skewed nature of our data, some tuning will be discusssed later.
Now, let’s take a look at the heatmap of the calculated relaxation times,
my_palette <- colorRampPalette(c("red", "yellow", "green"))(n = 299)
col_breaks = c(seq(0,50,length=100), # for red
seq(51,300,length=100), # for yellow
seq(301,1000,length=100)) # for green
heatmap.2(relaxation_matrix,main = "Relaxation coefficient",
notecol=NULL,density.info="none", col=my_palette,
breaks=col_breaks,Colv="NA", Rowv="NA",
xlab = "Attraction->", ylab= "<-Volume fraction")
We can clearly see that the relaxation time decreases from top to bottom, this implies that the gels with higher volume fraction, happen to evolve faster. This is consistent with what we already observed. Yet there is no specific pattern observed in “attraction” direction, seems like in this specific regime, attraction is not playing a major role in gelation process. Let’s see how volume fraction is affecting the relaxation time for different simulations, for this purpose, group the simulations with same attraction together, now plot each group’s relaxation time w.r.t. volume fraction.
par(mfrow=c(1,2))
matplot(x=phi_list, y=(relaxation_matrix), type = "p", lty = 1:5, lwd = 1, pch = NULL,
col = 1:6, cex = NULL, bg = NA,
xlab = expression(paste("Volume fraction (",phi,")")),
ylab =expression(paste("Relaxation time (",tau,")")),
xlim = NULL, ylim = NULL)
matplot(x=log(phi_list), y=log(relaxation_matrix), type = "p", lty = 1:5, lwd = 1, pch = NULL,
col = 1:6, cex = NULL, bg = NA,
xlab = expression(paste("log(",phi,")")),
ylab =expression(paste(" log(",tau,")")),
xlim = NULL, ylim = NULL)
Interestingly, all the curves collapse on each other almost perfectly, showing that simulations of different attractions are dynamically same. You can also see in the second figure that, the log of relaxation time changes linearly with time. Now, lets use linear regression to find the correlation between volume fraction and relaxation time. Here’s the mathematical grooming of the equations: \(\tau = a_0 e ^{-\phi / \phi_0} \therefore log(\tau) = log(a_0) - \phi / \phi_0\) Therefore, this is what we feed to the linear regression model: \(y = ax + b\), where \(y \equiv log(\tau)\), \(a = log(a_0)\) and \(b \equiv -1/\phi_0\).
tau_phi = matrix(0, nrow = length(phi_list), ncol = 5)
tau_phi[,1] = phi_list
set.seed(12345)
for (i in c(1:(ncol(relaxation_matrix)))){
lm2 = lm(log(relaxation_matrix[,i])~I(phi_list - 0))
tau_phi[i,2] = -1/lm2$coefficients[2]
tau_phi[i,3] = exp(lm2$coefficients[1])
tau_phi[,4] = mean(relaxation_matrix[i,])
}
tau_phi[,5] = tau_phi[,3] * exp(-tau_phi[,1] / tau_phi[,2])
So, till now, we learned that \(N(t) = N_0 e^{-t / \tau}\) where \(N_0 = 1000\), i.e the total number of particles and \(\tau\) is a function of \(\phi\) such that \(\tau(\phi) = \tau_0 e^{-phi / \phi_0}\) where :
N_0 = 1000
tau_0 = mean(tau_phi[i,3])
tau_0
## [1] 756
and :
phi_0 = mean(tau_phi[i,2])
phi_0
## [1] 0.0414
Therefore, we can write the relaxation as : \(N(t) = N_0 e^{-t / (\tau_0 e ^ {-phi / phi_0})}\).
In this part, MSE and MAD are used to calculate the accuracy of our model, for different simulations.
MSE = rep(0, length(phi_list))
MAD = rep(0, length(phi_list))
for (sim in c(1:ncol(gel.data[-1]))){
tmp.phi = extract_values(names(gel.data)[sim])[1]
n.pred = length(which(is.na(gel.data[,sim]) == FALSE))
actual = gel.data[c(1:n.pred),sim]
i = which(phi_list == tmp.phi)
Time = gel.data$timestep[c(1:n.pred)]
pred = N_0 * exp(-Time / (tau_0 * exp(-tmp.phi / phi_0))) + 1
MSE[i] = MSE[i] + mean((pred/1000 - actual/1000)^2,na.rm = TRUE)
MAD[i] = MAD[i] + mean(abs(pred/1000 - actual/1000),na.rm = TRUE)
}
After calculating the errors, plot them for different volume fractions:
min_x = min(phi_list);max_x = max(phi_list)
min_y = min(MAD,MSE);max_y = max(MAD,MSE)
plot(c(min_x,max_x),c(min_y,max_y),xlab = expression(phi),ylab = "MSE/MAD")
points(phi_list,MSE,col="cyan",lwd=2.5)
points(phi_list,MAD,col="red",lwd=2.5)
legend(min_x,0.9*max_y, c("MSE","MAD"),
lty=c(1,1), lwd=c(2.5,2.5),col=c("cyan","red"))
This plot suggests that MSE is always in a good range of error but MAD increases at the start and end of the volume fraction window, Let’s plot some of our actual data versus predictions to see why this is happening:
par(mfrow=c(1,3)); Time = gel.data$timestep
for (i in c(1,5,11)){
target = gel.data[,(i*11)]; n.m = length(which(is.na(target)==FALSE))
phi.t = extract_values(names(gel.data)[i*11])[1]
pred = N_0 * exp(-Time / (tau_0 * exp(-phi.t / phi_0)))
min_x = min(Time); max_x = n.m;
min_y = min(target, na.rm = TRUE);max_y = max(target, na.rm = TRUE);
plot(c(min_x,n.m),c(0,1000),type="n"
, xlab = "Time",ylab = "Number of Clusters",
main = phi.t)
points(Time,target,col="red",lwd=2.5)
points(Time,pred,col="cyan",lwd=2.5)
legend(0.2*max_x,0.99*max_y, c("Actual","Pred"),
lty=c(1,1), lwd=c(2.5,2.5),col=c("red","cyan"))
}
This figure shows why the error is so large for the low volume fractions, the reason is that the regression model tries to fit on as many data points as possible and in this problem, relaxation is quick which makes the model converge to zero faster than it shoud. Therefore, we need to further tune the data frame and model, in order to get better results. Lets train the model again, this time, start the learning after the very sharp initial steps. (Start training set when the number of clusters drops below 100)
new.tr = 100
min.tr = 5
relaxation_matrix = matrix(0, nrow = length(phi_list), ncol = length(D0_list))
relaxation_matrix2 = matrix(0, nrow = length(phi_list), ncol = length(D0_list))
R_sq = matrix(0, nrow = length(phi_list), ncol = length(D0_list))
#gel.data = gel.data[-c(1:10),]
set.seed(12345)
for (i in c(1:(ncol(gel.data)-1))){
train_rows = which(gel.data[,i] > new.tr);
train_rows = train_rows[-c(1:min.tr)]
lm1 = lm(I(log(gel.data[train_rows,i]) - 0) ~ I(timestep[train_rows] - log(1000)), data = gel.data)#, offset = phi0.05D02
tmp.phi = extract_values(names(gel.data)[i])[1]
tmp.D0 = extract_values(names(gel.data)[i])[2]
row.ind = which(phi_list==tmp.phi); col.ind = which(D0_list==tmp.D0)
relaxation_matrix[row.ind, col.ind] = -1/lm1$coefficients[2]
relaxation_matrix2[row.ind, col.ind] = lm1$coefficients[1]
R_sq[row.ind, col.ind] = summary(lm1)$r.squared
}
tau_phi = matrix(0, nrow = length(phi_list), ncol = 5)
tau_phi[,1] = phi_list
set.seed(12345)
for (i in c(1:(ncol(relaxation_matrix)))){
lm2 = lm(log(relaxation_matrix[,i])~I(phi_list - 0))
tau_phi[i,2] = -1/lm2$coefficients[2]
tau_phi[i,3] = exp(lm2$coefficients[1])
tau_phi[,4] = mean(relaxation_matrix[i,])
}
tau_phi[,5] = tau_phi[,3] * exp(-tau_phi[,1] / tau_phi[,2])
N_0 = 1000
tau_0 = mean(tau_phi[i,3])
phi_0 = mean(tau_phi[i,2])
MSE = rep(0, length(phi_list))
MAD = rep(0, length(phi_list))
for (sim in c(1:ncol(gel.data[-1]))){
tmp.phi = extract_values(names(gel.data)[sim])[1]
n.start = length(which(gel.data[,sim] > new.tr))
n.pred = length(which(is.na(gel.data[,sim]) == FALSE))
actual = gel.data[c(n.start:n.pred),sim]
i = which(phi_list == tmp.phi)
Time = gel.data$timestep[c(n.start:n.pred)]
pred = N_0 * exp(-Time / (tau_0 * exp(-tmp.phi / phi_0))) + 1
MSE[i] = MSE[i] + mean((pred/1000 - actual/1000)^2,na.rm = TRUE)
MAD[i] = MAD[i] + mean(abs(pred/1000 - actual/1000),na.rm = TRUE)
}
min_x = min(phi_list);max_x = max(phi_list)
min_y = min(MAD,MSE);max_y = max(MAD,MSE)
plot(c(min_x,max_x),c(min_y,max_y),xlab = expression(phi),ylab = "MSE/MAD")
points(phi_list,MSE,col="cyan",lwd=2.5)
points(phi_list,MAD,col="red",lwd=2.5)
legend(min_x,0.9*max_y, c("MSE","MAD"),
lty=c(1,1), lwd=c(2.5,2.5),col=c("cyan","red"))
After tuning the model, we can see that the error values are significantly improved now. The reason is that with the tuning we did, there is a better fit at the initial steps as well as trivial/steady state later steps.
Conclustion of the first part: At this specific regime, there is no effect of attraction on the evolution process of gelation. To study the effects of volume fraction, the relaxation times were first calculated and the exponential decay of relaxation time with volume fraction was alsso modeled. Therefore, one can predict the time it takes for his gel to prepare if they know their mixture properties.
In an effort to expand the work on a dynamically more complicated system, a specific regime was simulated where the gelation evolution of the matter is a result of an interplay between volume fraction and interparticle attraction. Let’s start the second part:
In the first part of the project, the interparticle attraction does not play an important role. In this part, new simulations are done to get attraction involved in the phase space. This time means, we need a more complicated model. Following steps are done:
svm to classify simulations as gel/not gel using the \(\phi\) and \(D_0\) features.nn for the same purposeRead the new data:
gel.data2 = read.csv('/Users/nabi_137/DA5030/Project/RMD/N_Clusters2.csv', header = TRUE)
Initiallize list of simulation features for future use:
phi_list = c(0.05,0.1,0.15,0.2,0.25,0.3,0.35,0.4,0.45,0.5,0.55,0.6)
D0_list = c(1,4,7,10,13,16,19,22)
Use summary() function to observe 10 random attributes:
summary(gel.data2[,sample(ncol(gel.data2),10)])
## phi0.1D016 phi0.35D07 phi0.5D022 phi0.6D04 phi0.6D010
## Min. :178 Min. : 29 Min. : 1.00 Min. : 1.0 Min. :1.00
## 1st Qu.:229 1st Qu.:158 1st Qu.: 1.00 1st Qu.: 1.0 1st Qu.:1.00
## Median :259 Median :168 Median : 1.00 Median : 1.0 Median :1.00
## Mean :275 Mean :168 Mean : 1.01 Mean : 1.1 Mean :1.26
## 3rd Qu.:315 3rd Qu.:178 3rd Qu.: 1.00 3rd Qu.: 1.0 3rd Qu.:1.00
## Max. :918 Max. :221 Max. :20.00 Max. :11.0 Max. :8.00
## phi0.15D022 phi0.4D04 phi0.5D07 phi0.55D022 phi0.05D04
## Min. : 1 Min. : 22 Min. : 1.00 Min. :1 Min. :946
## 1st Qu.: 1 1st Qu.:174 1st Qu.: 5.00 1st Qu.:1 1st Qu.:963
## Median : 3 Median :184 Median : 6.00 Median :1 Median :967
## Mean : 8 Mean :184 Mean : 6.39 Mean :1 Mean :966
## 3rd Qu.: 5 3rd Qu.:194 3rd Qu.: 8.00 3rd Qu.:1 3rd Qu.:970
## Max. :680 Max. :234 Max. :24.00 Max. :9 Max. :984
This time, the data set is already clean and there is no NA value, all attributes are of the same size. Let’s make sure there is no NA value:
which(is.na(gel.data2) == TRUE)
## integer(0)
Great! Now, let’s take a look at the correlatios:
set.seed(12345)
chart.Correlation(log(gel.data2[,c(97,1,8,45,80)]), histogram=TRUE, pch=19)
Clearly, now we’re dealing with a more complicated system. There are several cases that must be causiosly treated. A simulation where no network is formed whatsoever (“phi0.05D01”), relaxation time is actually infinity here (\(\tau = \infty\)). A case where gradual relaxing happens (“phi0.05D022”), just like what we saw in part 1, and a case where relaxation happens so fast that it appears almost as a vertical line in the correlation plot (“phi0.05D022”). Make a data frame of the features and the class attribute for future use:
z = rep(0,(ncol(gel.data2)-1))
c.a = rep(FALSE,(ncol(gel.data2)-1))
phase.df = data.frame("phi"=z,"D_0"=z,"N_C"=z,"is.gel"=c.a,"svm"=c.a,"nn"=c.a,"regression"=c.a)
Phase of each simulation is calculated here, using the mean() of number of clusters at the last time steps.
tr.indc = sample(ncol(gel.data2[-1]), floor(0.8 * ncol(gel.data2)))
gel.tr2 = gel.data2[tr.indc]
gel.vl2 = gel.data2[-tr.indc]
max.timestep = 0;
phase_space2 = matrix(0, nrow = length(phi_list), ncol = (length(D0_list)))
bool_phase_space = phase_space2
for (sim in c(1:(ncol(gel.data2) -1 ) )){
cut_off_vals = which(gel.data2[,sim] > 0)
tmp.phi = extract_values(names(gel.data2)[sim])[1]
tmp.D0 = extract_values(names(gel.data2)[sim])[2]
phase.df$phi[sim] = tmp.phi;phase.df$D_0[sim] = tmp.D0;
last.num = max(which(is.na(gel.data2[,sim]) == FALSE))
ps.x = which(phi_list==tmp.phi); ps.y=which(D0_list==tmp.D0)
phase_space2[ps.x, ps.y] =
mean(gel.data2[c(last.num-500:last.num),sim],na.rm = TRUE)
phase.df$N_C[sim]=phase_space2[ps.x, ps.y]
if (phase_space2[ps.x, ps.y] < 15){
bool_phase_space[ps.x,ps.y] = 1
phase.df$is.gel[sim]=TRUE
}
if (length(cut_off_vals) > 0){
if(max(cut_off_vals)>max.timestep) {max.timestep = max(cut_off_vals)}
gel.data2[-cut_off_vals,sim] = NA
}
}
Before going further, plot the phase space. We need to plot the final state of the mixture with respect to the system configuration.
my_palette <- colorRampPalette(c("red", "yellow", "green"))(n = 299)
col_breaks = c(seq(0,3,length=100), # for red
seq(4,100,length=100), # for yellow
seq(101,1000,length=100)) # for green
heatmap.2(phase_space2[,-1],main = "Phase Space",
notecol=NULL,density.info="none", col=my_palette,
breaks=col_breaks,Colv="NA", Rowv="NA",
xlab = "Attraction->", ylab= "<-Volume fraction")
In this heatmap, red blocks represent a completely formed gel and green represents a liquild-like mixture. We can see that now both volume fraction and attraction are determining the phase space, actually, there are cases of the same volume fraction, where increase of attraction, solely, leads to gelation.
Make a svm model to find the gelation boundary and plot it. Store the results of svm predictor so that we can compare the models.
set.seed(12345)
suppressMessages(suppressWarnings(library(kernlab)))
svm.m <- ksvm(is.gel ~ phi + D_0, data = phase.df, type="C-svc",C=10, kernel = "rbfdot")
plot(svm.m, data = phase.df, grid = 200 ,C = 10, kernel = "rbfdot")
phase.df$svm = ifelse(predict(svm.m,phase.df[,c(1,2)])==2, TRUE,FALSE)
Here, svm better shows us how non-linear the phase space becomes in higher volume fractions. Now, train a neural network on the data. In this case, the rbfdot kernel shows to be a good fit and a cost 10 is selected, although, changing the cost does not affect the results significantly. Models are evaluated and compared at the end of the work. Let’s see how accurate our model is:
agreement_svm <- phase.df$svm == phase.df$is.gel
prop.table(table(agreement_svm))
## agreement_svm
## TRUE
## 1
\(100\%\) accuracy! This is not much weird because of two reasons:
set.seed(12345)
suppressMessages(suppressWarnings(library(neuralnet)))
nn.m <- neuralnet(is.gel ~ phi + D_0, data = phase.df)
nn.result <- neuralnet::compute(nn.m, phase.df[,c(1,2)])
phase.df$nn <- ifelse(nn.result$net.result>0.5,TRUE,FALSE)
agreement_nn <- phase.df$nn == phase.df$is.gel
prop.table(table(agreement_nn))
## agreement_nn
## FALSE TRUE
## 0.0208 0.9792
Neural network also does a great job predicting the boundary, it’s not fair to say svm worked better because of the very small difference!
Here, we tend to make a model to find a correlation that relates number of clusters at each time, to \(\phi\) and \(D_0\). The desired equation will be in the form of: \(N_C(t) = f(t,\phi,D_0)\), If we manage to pull this off, we can calculate the number of clusters at an arbitrary time and say if an interconnected network of particles (AKA gel) is formed. First, let’s take a look at the correlation between final number of clusters and volume fraction/attraction seperately.
par(mfrow=c(1,2)); Time = gel.data2$timestep
min_x = 0;max_x = max(phi_list);min_y = 0;max_y = 1000
plot(c(min_x,max_x),c(min_y,max_y),xlab = expression(phi),ylab = "Number of Clusters")
for (i in c(1:length(D0_list))){
indcs = c(1:length(phi_list)) * 8 - 8
points(phi_list,phase.df$N_C[(indcs+i)],col=i,lwd=2.5)
}
par(new = FALSE)
plot(c(min_x,22),c(min_y,max_y),xlab = expression(D0),ylab = "Number of Clusters")
for (i in c(1:length(phi_list))){
indcs = c(1:length(D0_list)) + (i - 1) * 8
points(D0_list,phase.df$N_C[indcs],col=i,lwd=2.5)
}
Before training the model, we must note that when \(\phi >0.45\) gelation always happens regardless of attraction; Therefore, we must ignore these in our regression model, to prevent errors.
to.rm = which(phase.df$phi > 0.5)
phase.df.2 = phase.df[-to.rm,]
In this part, we try to predict the final number of clusters using the given features.
phi_list = phi_list[-c(12,11,10)]
par(mfrow=c(1,2)); Time = gel.data2$timestep
min_x = 0;max_x = max(phi_list);min_y = 0;max_y = 1000
plot(c(min_x,max_x),c(min_y,max_y),xlab = expression(phi),ylab = "Number of Clusters")
for (i in c(1:length(D0_list))){
indcs = c(1:length(phi_list)) * 8 - 8 + i
points(phi_list,(phase.df.2$N_C[indcs]),col=i,lwd=2.5)
model <- lm(phase.df$N_C[indcs] ~ poly(phi_list,7))
new_data = phase.df$phi[indcs]
predicted.intervals <- predict(model,data = new_data,interval='confidence',
level=0.99)
lines(phi_list,(predicted.intervals[,1]),col=i,lwd=3)
}
par(new = FALSE)
plot(c(min_x,22),c(min_y,max_y),xlab = expression(D0),ylab = "Number of Clusters")
for (i in c(1:length(phi_list))){
indcs = c(1:length(D0_list)) + (i - 1) * 8
points(D0_list,(phase.df.2$N_C[indcs]),col=i,lwd=2.5)
model <- lm(phase.df$N_C[indcs] ~ poly(D0_list,2))
new_data = phase.df$D_0[indcs]
predicted.intervals <- predict(model,data = new_data,interval='confidence',
level=0.9)
lines(D0_list,(predicted.intervals[,1]),col=i,lwd=3)
}
After several trial error processses, unfortunately, in this physical regime, our meaningful simulations are limited to a very narrow range of features, this makes our useful data much smaller than it already is, we need much more simulation data to make this work. This problem is left off at this point.
After failing to make the regression model to work for the second part, here, resuls of the svm and neural network are compared. The comparison is not of high significance as both models worked great for this problem. Although, svm always worked almost perfectly for this specific data while neural network was more sensitive to changes in the algorithm/data.
accuracy.df = data.frame("svm"=0,"nn"=0)
n.tot = nrow(phase.df.2)
accuracy.df$svm = length(which(phase.df.2$svm == phase.df.2$is.gel))/n.tot
accuracy.df$nn = length(which(phase.df.2$nn == phase.df.2$is.gel))/n.tot
barplot(c(accuracy.df$svm,accuracy.df$n),xlab = "SVM vs NN",ylab = "Accuracy")